[TEMP]feat: add torch profiler for the function update_weights#10
Conversation
JD-ETH
left a comment
There was a problem hiding this comment.
Vibe coding is convenient, and often even elegant, lol. But it tends to be very verbose, redundant, with fail safes --- let's try to be concise and careful here, and prioritize human readibility first; it's better to throw errors and crush runs immediately for the profiling.
slime/backends/megatron_utils/update_weight/update_weight_from_remote.py
Outdated
Show resolved
Hide resolved
slime/backends/megatron_utils/update_weight/update_weight_from_remote.py
Show resolved
Hide resolved
| parser.add_argument( | ||
| "--profile-update-weight-start", | ||
| type=int, | ||
| default=0, |
There was a problem hiding this comment.
let's just have the start default to 1, and log everything afterwards? We only do 3 training steps anyways and this profiler will only be used for our profiling configs.
There was a problem hiding this comment.
you mean we could skip the first step?
| ) | ||
| parser.add_argument("--check-weight-update-equal", action="store_true") | ||
| parser.add_argument( | ||
| "--use-pytorch-profiler-update-weight", |
There was a problem hiding this comment.
does this impact performance of the run? If not, let's just set it default on
There was a problem hiding this comment.
if we're not targeting at merging this feature into the official slime repo, i think it's ok. And i'll also tag this pr with [TEMP] and we'll revert this PR in the end.
| default=0, | ||
| help="After enabling PyTorch profiler for weight update operations, start profiling from this point. Requires --tensorboard-dir to be set.", | ||
| ) | ||
| parser.add_argument( |
There was a problem hiding this comment.
or keep it, just like the profiler of slime itself?
There was a problem hiding this comment.
This is the interesting part. Slime and Megatron-LM share the same argument use-pytorch-profiler (/root/Megatron-LM/megatron/training/arguments.py). We could also find the usage in TrainProfiler of /root/slime/slime/utils/profile_utils.py:
class TrainProfiler:
def __init__(self, args):
self.args = args
self._torch_profiler_overall = None
self._memory_profiler_overall = None
if args.use_pytorch_profiler and ("train_overall" in args.profile_target):
self._torch_profiler_overall = _create_torch_profiler(args, name="train_overall")
if args.record_memory_history and ("train_overall" in args.profile_target):
self._memory_profiler_overall = _BaseMemoryProfiler.create(args)
self._memory_profiler_overall.start()
Basically, when --use-pytorch-profiler enabled, it will record all python functions between steps, and it's quite large (>100MB) if we want to get the mapping between python functions and the cpu/gpu occupying. Besides that, there're too many redundant parts in this profiler, since all we care about is updating weights. That's why i create another function profiler
|
Updated @JD-ETH |
| default=0, | ||
| help="After enabling PyTorch profiler for weight update operations, start profiling from this point. Requires --tensorboard-dir to be set.", | ||
| ) | ||
| parser.add_argument( |
| ) | ||
| parser.add_argument("--check-weight-update-equal", action="store_true") | ||
| parser.add_argument( | ||
| "--use-pytorch-profiler-update-weight", |
update_weightsupdate_weights
|
Based on the comments, mark this PR as [TEMP], and it will be reverted in the future. |
Add torch profiler for the update_weights part, with the tag `--use-pytorch-profiler-update-weight`. Users could find the file `update_weights_call{step}_rank_{gpu_id}.pt.trace.json.gz`. Open it with `https://ui.perfetto.dev/`
Add torch profiler for the update_weights part, with the tag
--use-pytorch-profiler-update-weight. Users could find the fileupdate_weights_call{step}_rank_{gpu_id}.pt.trace.json.gz. Open it withhttps://ui.perfetto.dev/and the result is sth like: